This report is submitted by Greeshma Jeev Koothuparambil and Olayemi Morrison as a part of Laboratory 1 of Visualization (732A98) Course for the 2023 Autumn Semester.

Assignment 1

The modified Tree using InkScape

The modified Tree using InkScape


Assignment 2

Following are the libraries used for the successful completion of this assignment:
ggplot2
gridExtra
dplyr
cowplot
ggpubr
plotly
shiny

Here is how we loaded our libraries:

library(ggplot2)
library(gridExtra)
library(dplyr)
library(cowplot)
library(ggpubr)
library(plotly)
library(shiny)

1. Read data from SENIC.txt into R

#reading the file and renaming the columns for easy readability.

df <- read.table("SENIC.txt")
description <- c("ID", "Length_of_Stay", "Age",
                 "Infection_Risk", "Routine_Culturing_Ratio", "Routine_Chest_Xray_Ratio",
                 "Number_of_Beds", "Medical_School_Affiliation", "Region",
                 "Average_Daily_Census", "Number_of_Nurses", "Available_Facilities_And_Services")
colnames(df)<- description

The loaded dataframe looks like this:

ID Length_of_Stay Age Infection_Risk Routine_Culturing_Ratio Routine_Chest_Xray_Ratio
1 7.13 55.7 4.1 9.0 39.6
2 8.82 58.2 1.6 3.8 51.7
3 8.34 56.9 2.7 8.1 74.0
4 8.95 53.7 5.6 18.9 122.8
5 11.20 56.5 5.7 34.5 88.9
6 9.76 50.9 5.1 21.9 97.0
Number_of_Beds Medical_School_Affiliation Region Average_Daily_Census Number_of_Nurses Available_Facilities_And_Services
279 2 4 207 241 60
80 2 2 51 52 40
107 2 3 82 54 20
147 2 4 53 148 40
180 2 1 134 151 40
150 2 2 147 106 40

It has 113 observations and 12 variables namely:
ID, Length of Stay , Age ,Infection Risk , Routine Culturing Ratio , Routine Chest Xray Ratio ,Number of Beds, Medical School Affiliation , Region ,Average Daily Census , Number of Nurses , Available Facilities And Services.

The Summary of the table is as follows:

##        ID      Length_of_Stay        Age        Infection_Risk 
##  Min.   :  1   Min.   : 6.700   Min.   :38.80   Min.   :1.300  
##  1st Qu.: 29   1st Qu.: 8.340   1st Qu.:50.90   1st Qu.:3.700  
##  Median : 57   Median : 9.420   Median :53.20   Median :4.400  
##  Mean   : 57   Mean   : 9.648   Mean   :53.23   Mean   :4.355  
##  3rd Qu.: 85   3rd Qu.:10.470   3rd Qu.:56.20   3rd Qu.:5.200  
##  Max.   :113   Max.   :19.560   Max.   :65.90   Max.   :7.800  
##  Routine_Culturing_Ratio Routine_Chest_Xray_Ratio Number_of_Beds 
##  Min.   : 1.60           Min.   : 39.60           Min.   : 29.0  
##  1st Qu.: 8.40           1st Qu.: 69.50           1st Qu.:106.0  
##  Median :14.10           Median : 82.30           Median :186.0  
##  Mean   :15.79           Mean   : 81.63           Mean   :252.2  
##  3rd Qu.:20.30           3rd Qu.: 94.10           3rd Qu.:312.0  
##  Max.   :60.50           Max.   :133.50           Max.   :835.0  
##  Medical_School_Affiliation     Region      Average_Daily_Census
##  Min.   :1.00               Min.   :1.000   Min.   : 20.0       
##  1st Qu.:2.00               1st Qu.:2.000   1st Qu.: 68.0       
##  Median :2.00               Median :2.000   Median :143.0       
##  Mean   :1.85               Mean   :2.363   Mean   :191.4       
##  3rd Qu.:2.00               3rd Qu.:3.000   3rd Qu.:252.0       
##  Max.   :2.00               Max.   :4.000   Max.   :791.0       
##  Number_of_Nurses Available_Facilities_And_Services
##  Min.   : 14.0    Min.   : 5.70                    
##  1st Qu.: 66.0    1st Qu.:31.40                    
##  Median :132.0    Median :42.90                    
##  Mean   :173.2    Mean   :43.16                    
##  3rd Qu.:218.0    3rd Qu.:54.30                    
##  Max.   :656.0    Max.   :80.00

2. Create a function that for a given column (vector) X does the following:
a. It computes first and third quartiles Q1 and Q3 with quantiles()
b. It returns indices of outlying observations, i.e. observation with X-values greater than Q3+1.5(Q3-Q1) or less than Q1-1.5(Q3-Q1).

#defining outliers function
  ## outlier function takes dataframe column name as input and returns indices of outliers.

outliers_fun <- function(X){
    quant <- quantile(X)
    lessout<- quant[2]-1.5*(quant[4]-quant[2])
    greatout  <- quant[3]+1.5*(quant[4]-quant[2])
    outlier_ind <- row.names(df[X<lessout|X>greatout,])
    return(outlier_ind)
}

3. Use ggplot2 and the function from step 2 to create a density plot of Infection risk in which outliers are plotted as a diamond symbol ( ◊ ) . Make some analysis of this graph.

#Density Plot
  ## plotting for Infection Risk

vals <- as.numeric(outliers_fun(df$Infection_Risk))
outliers <- df$Infection_Risk[vals]
outliersdf <- as.data.frame(outliers)
#outliersdf$outliers <- as.factor(as.character(outliersdf$outliers))

pl <- ggplot(df, aes(x= Infection_Risk)) + geom_density(fill ="blue", alpha =0.5)+ylab("Density")
pl <- pl + geom_point(data =outliersdf, mapping=aes(x= outliers, y = 0), colour = "Red", shape =5)

The plot looks like this:

Analysis

The highest point in our density plot of Infection risk falls around 4.5, indicating that this is where the majority of the US hospitals fall. This means that approximately 40% of hospitals have an infection risk rate of_4.5.The hospitals that have extremely low infection risk below 2 and extremely high infection risk of 8 are considered outliers. Their results are significantly different from the majority of the hospitals. It could be as a result of any number of factors such as age, number of nurses, available resources and so on. But an interesting pattern to be observed is the predifined outliers function does not explain the density dip at the graph. Absence or lower density of values can also be considered a case of outliers which is not identified by the outlier function which uses quartiles to define outliers.


4. Produce graphs of the same kind as in step 3 but for all other quantitative variables in the data (aes_string() can be useful here). Put these graphs into one (hint: arrangeGrob() in gridExtra package can be used) and make some analysis.

  ##Plotting for rest of the data
columns <- as.integer(c(2:ncol(df)))
columns <- columns[-which(colnames(df[columns]) %in% c("Infection_Risk","Medical_School_Affiliation", "Region"))]
loopplotlist <- list()
j=1

for (i in columns) {
  loopvals <- as.numeric(outliers_fun((df[,i])))
  loopoutliers <- df[loopvals,i]
  loopoutliersdf <- as.data.frame(loopoutliers)
 # loopoutliersdf$loopoutliers <- as.factor(as.character(loopoutliersdf$loopoutliers))
  
  loopplot <- ggplot(df, aes_string(x= names(df[i]))) + geom_density(fill ="blue", alpha =0.5)+ ylab("Density")
  loopplotlist[[j]] <- loopplot + geom_point(data =loopoutliersdf, mapping=aes(x= loopoutliers, y = 0), colour = "Red", shape =5)
  j=j+1
}
graphs <- arrangeGrob(grobs = loopplotlist[], ncol = 2,nrow = 4)
p <- as_ggplot(graphs)

The generated grid plot looks like ths:

Analysis

A similar shape/pattern can be identified in the following factors: Length of Stay, Routine Culturing Ratio, Number of beds, Average Daily Census, and Number of nurses. These graphs are skewed towards the left. It can be deduced in these graphs that, the lower the value of these factors, the higher the infection risk in those hospitals. However, for other factors such as Age and Available Facilities & Services, the infection risk falls in the middle.


5. Create a ggplot2 scatter plot showing the dependence of Infection risk on the Number of Nurses where the points are colored by Number of Beds. Is there any interesting information in this plot that was not visible in the plots in step 4? What do you think is a possible danger of having such a color scale?

# Dependence plot between Infection Risk, Number of Nurses and number of Beds
dependence<-ggplot(df,aes(x=Infection_Risk, y=Number_of_Nurses, color=Number_of_Beds)) + 
  geom_point()

The resulting graph is as follows:

Analysis

There are some outliers in the scatter plot that shows hospitals with a large number of beds and nurses (approximately over 600 each), yet they still have a relatively high infection risk above 4.5. On the lower end of the plot, a hospital with fewer beds has a lower infection risk. Having the color go lighter as the number of beds increases makes it easy to misinterpret the data, as it is easy to assume that the opposite is the case.


6. Convert graph from step 3 to Plotly with ggplotly function. What important new functionality have you obtained compared to the graph from step 3? Make some additional analysis of the new graph.

## converted to plotly
plplotly <- ggplotly(pl)

l1 <- ggplotly(loopplotlist[[1]])
l2 <- ggplotly(loopplotlist[[2]])
l3 <- ggplotly(loopplotlist[[3]])
l4 <- ggplotly(loopplotlist[[4]])
l5 <- ggplotly(loopplotlist[[5]])
l6 <- ggplotly(loopplotlist[[6]])
l7 <- ggplotly(loopplotlist[[7]])
l8 <- ggplotly(loopplotlist[[8]])

loopplotlistplotly <- subplot(l1, l2, l3, l4, l5, l6, l7, l8, nrows = 4, margin = 0.05)%>%
  layout(title = "DENSITY PLOTS ON VARIOUS FACTORS AT HOSPITALS")

dependenceplotly<- dependence + ggtitle("DEPENDENCY OF INFECTION RISK ON THE NUMBER OF NURSES\n
                                       SCALED ALONG THE NUMBER OF BEDS")
dependenceplotly <- ggplotly(dependenceplotly)

The following are the graphs obtained:

Analysis

After converting to Plotly, the scatter plot becomes more interactive. More details are shown as you hover over each dot. With this new feature, we can further deduce that the outliers mentioned previously have over 600 nurses, and 800 beds. It is also shown that majority of hospitals have the number of beds and nurses below 200.


7. Use data plot-pipeline and the pipeline operator to make a histogram of Infection risk in which outliers are plotted as a diamond symbol ( ◊ ) . Make this plot in the Plotly directly (i.e. without using ggplot2 functionality). Hint: select(), filter() and is.element() functions might be useful here.

#pipeline

histodf <- filter(df,is.element(df$Infection_Risk, outliers)) %>% 
  select('Infection_Risk')

histplot <- df %>% plot_ly(x=~Infection_Risk, type = "histogram") %>%
  add_markers(x = ~jitter(as.numeric(histodf$Infection_Risk)),
              y = 0,
              marker = list(size = 7, symbol= 'diamond-dot'),
              showlegend = FALSE) %>% layout(title = 'HISTOGRAM ON INFETION RISK OF HOSPITALS', yaxis = list(title = 'Frequency'))

The histogram obtained is as follows:

Analysis

The histogram makes sense on the dip visible around value 7 in the Density plot obtained above. The graph shows that no data has been received on an Infection Risk that falls between 6.9 to 7.5. it also shows that there is an increased visibility of Infection Risk between 2.5 and 2.9 compared to its neighbouring values.


8. Write a Shiny app that produces the same kind of plot as in step 4 but in addition include: a. Checkboxes indicating for which variables density plots should be produced b. A slider changing the bandwidth parameter in the density estimation (‘bw’ parameter)

# Define UI 
ui <- fluidPage(
    checkboxGroupInput("variable", "Variables to show:",
                       c("Length_of_Stay", "Age",
                         "Routine_Culturing_Ratio", "Routine_Chest_Xray_Ratio" ,
                         "Number_of_Beds", "Average_Daily_Census" , "Number_of_Nurses" ,
                         "Available_Facilities_And_Services"), 
                       selected = "Average_Daily_Census"),
    sliderInput(inputId="ws", label="Choose bandwidth size", value=0.1, min=0.1, max=1),
    tableOutput("data"),
    plotOutput("outplot")
)

#Define server
server <- function(input, output, session) {
    output$data <- renderTable({
        df[1:10,input$variable]
    }, rownames = TRUE,)
    
    output$outplot <- renderPlot({
        dftemp <-  as.data.frame(df[,input$variable])
        colnames(dftemp) <- "data"
        outlierdf <- as.data.frame(dftemp[outliers_fun(dftemp[,1]),])
        colnames(outlierdf) <- "outliers"
        ggplot(dftemp, aes(x= data)) +
            stat_density(alpha=0.8, bw=input$ws, position="identity")+
            ylab("Density")+xlab(input$variable)+
            geom_point(data =outlierdf, mapping=aes(x= outliers, y = 0), colour = "Red", shape =5)
    })
}


# Run the application 
#shinyApp(ui = ui, server = server)
Analysis

The higher the bandwidth, the smoother the graph becomes. Graphs that contain more data, for example, Number of beds or Number of Nurses, a higher bandwidth ranging from 2.5 to 3.5 would be recommended, as it is practically unreadable when the bandwidth is set to anything less than 2. For graphs with less data, for example Length of Stay or Age, a lower bandwidth ranging from 0.25 to 0.35 is preferred, as anything higher would cause some data to be lost.


STATEMENT OF CONTRIBUTION

For the first assignment since we both are new to the InkScape Software, we both sat simultaneously and learned on various tools of InkScape through trial and Error method. Because of the great GUI the software seemed easy to use and we both aided each other on ways to accomplish the given task.

As for the second assignment coding was done by Greeshma Jeev and the Analysis part was done by Olayemi. We both went through the outputs and the analysis to make our own suggestions to the results inorder to make this report a grand success.